Public link to this page

Introduction

Graded task Logistic regression

Your task is to create a logistic regression model and to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The data is given in the sheet Heart_desease. The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.

Requirements: 1. Create a logistic regression model and estimate predictions. 2. Interpret models’ coefficients and Odd ratios. 3. Assess the model with performance metrics.”

Evaluation criteria: “1. How well the model creation in the task description is followed. 2. You can explain how you selected the best model. 3. You can interpret the models’ performance metrics and parameters. 4. You present fluent and clearly. Presentation should take 10-15 minutes. 5. Analytical approach to the problem”

Sample questions: “1. How have you selected your best model? 2. Why have you used the particular model’s performance metrics? 3. What is Odds ratio? 4. How do you interpret regression model’s intercept? 5. How do you interpret regression model’s coefficients?”

Column description

  • Sex: male or female(Nominal)
  • Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)
  • Current Smoker: whether or not the patient is a current smoker (Nominal)
  • Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

Medical( history)

  • BP Meds: whether or not the patient was on blood pressure medication (Nominal)
  • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
  • Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
  • Diabetes: whether or not the patient had diabetes (Nominal)
  • Tot Chol: total cholesterol level (Continuous)
  • Sys BP: systolic blood pressure (Continuous)
  • Dia BP: diastolic blood pressure (Continuous)
  • BMI: Body Mass Index (Continuous)
  • Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)
  • Glucose: glucose level (Continuous)

Predict variable (desired target) - 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

Abbreviations: - EDA : Exploratory Data Analysis - EV : Explanatory Variable (i.e. predictor)

Load libraries and data

library(tidyverse)
library(janitor)
library(emmeans)
library(dlookr)
library(flextable)
library(jtools)
library(GGally)
library(vcd)
library(ggstatsplot)
library(gridExtra)
library(caret)
library(jtools)
library(kableExtra)
library(plotly)
library(highcharter)
library(pROC)
   

select <- dplyr::select

bd <- "/Users/leonardo/My Drive/turing_college/Modules/Linear_and_Logistic_Regression/"

df <- read_csv(paste0(bd,"Graded_task_Heart_disease_data.csv")) %>% 
 tibble %>% 
 janitor::clean_names()

# change appropriate variables to factor
df <- df %>% mutate_at(
 c("male","education","current_smoker","bp_meds",
   "prevalent_stroke","prevalent_hyp",
   "diabetes","ten_year_chd"), factor
)

# df %>% glimpse()

# dlookr::diagnose(df) %>% flextable

Split train/test

To estimate the perfomance of the model with out-of-sample data, we split the whole dataset in a train (60% - 2543 data points) and test (40% - 1695 data points) set.

set.seed(124)
N = nrow(df)
split_index <- sample(c("train","test"), N, replace = T, prob = c(0.6,0.4))

train <- df[which(split_index == "train"),]  # train set
test <- df[which(split_index == "test"),]  # test set

Outliers check

We will carry out the check for outliers - and subsequently evaluate potential imputation for NA - only on the train data, and blindly apply this to the test. In a real world situation, the test set can grow continously, therefore we need to take decisions only on the train data.

The evaluation for potential NA imputation will be carried out later, just before modelling. The reason for this is that carrying out NA imputation first would bias the EDA.

Observations All the numerical values are in the norm, except for blood pressure. Specifically, the systolic blood pressure sys_bp is way too high, with about 4% of the participants having a bp > 180 (requiring immediate medical care).

However, in published papers on the Framinghton studies these values are accepted as plausible. We will therefore only exclude values above 250 mmHg.

# train %>% select_if(is.numeric) %>% summary()

# boxplot(glucose ~ diabetes, data = train, main = "glucose ~ diabetes")

boxplot(sys_bp ~ bp_meds, data = train, main = "systolic blood pressure ~ bp_meds")

boxplot(sys_bp ~ prevalent_hyp, data = train, main = "systolic blood pressure ~ hypertension")

# percent of people with sys_bp > 180
pct_high_sys_bp <- round((train$sys_bp > 180) %>% sum() / nrow(train) * 100,2)

# xtabs(~ prevalent_hyp + bp_meds, data = train)

train <- train %>% filter(sys_bp <= 250)
test <- test %>% filter(sys_bp <= 250)

EDA of numerical variables

Initially, we carry out t-tests to discover which EVs exhibit (corrected) significant differences between people with and without risk of CHD.

Then we will also examine the potential correlations between variables. If the latter are substantial, we will evaluate the possibility of discarding some EVs due to collinearity.

Two-independent-samples t-tests

(an excercise in fitting many models using purrr::map)

library(tidyr)

ttests <- train %>%
   # select("ten_year_chd", "age", "glucose") %>%
   
   select(-c("education","male","current_smoker","bp_meds",
           "prevalent_stroke","prevalent_hyp", "diabetes")) %>% 
   na.omit() %>%
   
   # pivot longer to prepare the nesting
   pivot_longer(
      cols = -ten_year_chd, 
      names_to = "variable", values_to = "value"
   ) %>% 
   group_by(variable) %>% 
   group_nest() %>%
   
   # fit a model to each nested df
   mutate(
    ttmod = map(data, ~ lm(value ~ ten_year_chd, data = .x))
   ) %>% 

   # get the stats
   mutate(
      tidy = map(ttmod, broom::tidy)
   ) %>% 
   unnest(tidy) %>% 
   filter(term == "ten_year_chd1") %>% 
   
   # correct for multiple comparisons
   mutate(adj_pval = p.adjust(p.value, method = "fdr")) %>% 
   # filter(adj_pval <= 0.01) %>% 
   relocate(adj_pval, .after=variable) %>% 
   arrange(adj_pval)


ttests %>%
  select(variable, statistic, adj_pval) %>%
  rename(`t value` = statistic) %>%
  flextable() %>%
  set_formatter(adj_pval = function(x) sprintf("%.2e", x))


Age

ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = age
)

Systolic blood pressure

ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = sys_bp
)

Diastolic blood pressure

# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = dia_bp
)

Glucose

# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = glucose
)

BMI

# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = bmi
)

Cholesterol level

# 
ggbetweenstats(
   data = train,
   x = ten_year_chd,
   y = tot_chol
)

Correlation matrix

Exploring the correlation matrix of the numerical variables to detect potential collinearity

train %>% 
 # select(-c("education","male","current_smoker","bp_meds",
 #           "prevalent_stroke","prevalent_hyp", "diabetes")) %>%
 select(ten_year_chd, age, sys_bp, dia_bp, bmi, glucose, tot_chol) %>% 
 na.omit() %>% 
 # mutate_at(vars(-ten_year_chd), ~ log(. + 1)) %>% 
 GGally::ggpairs(
    aes(color = ten_year_chd),
    upper = list(continuous = wrap("cor", size = 3)),
    lower = list(continuous = wrap("points", alpha = 0.7, size = 1)),
    diag = list(continuous = wrap("densityDiag", alpha = 0.5))
 )

Observations - numerical EVs

  • age appears to be the most discriminant variable for ten year CHD prediction. Other EVs which are significantly different (after correction with FDR) in people with and without CHD risk are bmi, sys_bp, dia_bp, glucose, tot_chol.

  • the systolic sys_bp and diastolic dia_bp blood pressure are correlated with each other - as expected. Since CHD is especially related to hypertension, we will only use systolic blood pressure

  • bmi and age are also correlated with blood pressure, however they appear to be important variables for CHD and their correlation with blood pressure is mild, so we will keep them

  • NB: the number of cigarettes per day (cigs_per_day) will be explored as a categorical variable (see below) with levels for tens of cigarettes per day

EDA of categorical variables

# train %>% select_if(is.factor) %>% summary

As before for the numerical variables, we will first assess a dependence between risk of CHD and other categorical variables, and then explore the reciprocal relationships of all categorical variables to assess the presence of collinearity.

Chi-square tests

Sex (male)

library(ggstatsplot)

ggbarstats(
   data = train %>% select(ten_year_chd, male),
   x = male,
   y = ten_year_chd,
   label = "both"
)

Hypertension

ggbarstats(
   data = train %>% select(ten_year_chd, prevalent_hyp),
   x = prevalent_hyp,
   y = ten_year_chd,
   label = "both"
)

Cigarettes per day

The values are too sparse to treat this as a continous variable. We will therefore group the # cigs per day in tens and treat this as a categorical variable.

A first visual exploration shows that there appear to be mild differences in the proportion of people smoking different amount of cigarettes per day.

When carrying out a chi-square test of independence, we are warned that the results might be incorrect. This appears to be due to the fact that there are too few observations in the category for 50 and 60 cigarettes per day. When these levels are removed, a significant (p = 0.0066) dependence is shown between risk group and amount of cigarettes per day.

# Suppress summarise info
options(dplyr.summarise.inform = FALSE)

train %>% 
   select(ten_year_chd, cigs_per_day) %>% 
   na.omit() %>% 
   mutate(cigs_factor = round(cigs_per_day/10,0)) %>% 
   group_by(ten_year_chd, cigs_factor) %>% 
   summarise(count = n(), .group = "drop") %>% 
   group_by(ten_year_chd) %>%
   mutate(prop = count / sum(count)) %>% 
   select(ten_year_chd, cigs_factor, prop) %>%
   ggplot(aes(x = factor(cigs_factor), y = prop, fill = ten_year_chd)) +
   geom_bar(stat = "identity", position = "dodge") + 
   labs(x = "Tens of Cigarettes per Day", y = "Proportion in each group", fill = "CHD Status")

train %>% 
   select(ten_year_chd, cigs_per_day) %>% 
   mutate(tens_cigarettes = round(cigs_per_day/10, 0) ) %>% 
   filter(tens_cigarettes < 5) %>% 
   select(-cigs_per_day) %>% 
   na.omit() %>% 
   xtabs(~ ten_year_chd + tens_cigarettes, data = .) %>% 
   chisq.test()  %>% tidy() %>% kable(format = "html") %>%
   add_header_above(c("Chi-Square CHD Risk ~ Tens_of_cigarettes" = 4)) %>%
   kable_styling()
Chi-Square CHD Risk ~ Tens_of_cigarettes
statistic p.value parameter method
14.22453 0.0066119 4 Pearson’s Chi-squared test

Current smoker

ggbarstats(
   data = train %>% select(ten_year_chd, current_smoker),
   x = current_smoker,
   y = ten_year_chd,
   label = "both"
)

Blood pressure medication

ggbarstats(
   data = train %>% select(ten_year_chd, bp_meds),
   x = bp_meds,
   y = ten_year_chd,
   label = "both"
)

Education

ggbarstats(
   data = train %>% select(ten_year_chd, education),
   x = education,
   y = ten_year_chd,
   label = "both"
)

Association matrix (categorical)

Graphically display the association between binary categorical variables, including ten_year_chd to assess the presence of correlated predictors

library(vcd)

train_cat <- train %>% select_if(is.factor) %>% select(-c("ten_year_chd","education"))

pairs(
   table(train_cat), 
   diag_panel = pairs_diagonal_mosaic(offset_varnames=-2.5),    #move down variable names
   upper_panel_args = list(shade = TRUE),                       #set shade colors
   lower_panel_args = list(shade = TRUE)
)

Observations - categorical EVs

  • The 10 year risk of CHD is more prevalent in males than in females
  • Not surprisingly, hypertension and assumption of blood pressure medications is also associated with risk. We will also examine the latter, although only ~ 3% of the sample takes these medications
  • The number of cigarettes smoked per day (in groups of 10) also increaes the risk of CHD. Interestingly, still, more than 50% of the people who are at risk do not smoke. It would be interesting to have data about the quality of the air of the place where they live, or whether their partner is a smoker.
  • It is also interesting that the relationship between amount of cigarettes per day and risk is not linear: there are more people at risk among those who smoke 20-30 or 40-50 cigs per day. while among those who smoke 10-20 or 30-40 cigs per day the proportion of people with and without risk is almost 50%. This strange behaviour can probably be - at least in part - explained by the fact that smokers tend to underestimate the amount of cigarettes smoked.
  • Instead, the fact of being a current smoker does not appear to be a risk factor
  • There is an interaction between being make, having hypertension and being a current smoker. We will examine whether this interaction can increase the prediction, although the fact of currently smoking per se apparently does not.

Choice of predictors

Numerical variables Age (age), High blood pressure (sys_bp), body-mass index (bmi), glucose blood level (glucose), cholesterol blood level (tot_chol) appear to be involved in predicting the risk of CHD. As noted above, cigs_per_day was treated as a categorical variables with levels for o and multiples of 10.

Categorical variables Sex (male), hypertension (prevalent_hyp), blood pressure medications (bp_meds), education (education) and number of cigarettes per day (cigs_per_day) appear to be involved in predicting the risk of CHD.

There are widespread associations between numerical and categorical variables. This is not surprising for two reasons:

  • some variables are proxy each other (e.g. sys_bp, prevalence_hyp, bp_meds)
  • some variables are linked by known relationships: e.g. with age, the bmi increases, and both are likely to lead to increased blood pressure sys_bp, which in turn can lead to hypertension and relative medicaments

Despite this, we will enter all the selected variables in the model and assess the presence of collinearity using VIF.

age

library(gridExtra)

boxplot_tt <- function(response, predictor) {
   tt <- t.test(train[[response]] ~ train[[predictor]], data =train)
   
   res <- paste0("t = ", round(tt$statistic,2), " p = ", round(tt$p.value,4))
   
   p <- train %>% na.omit() %>% 
      ggplot(aes(x = .data[[predictor]], y = .data[[response]])) +
      geom_boxplot() +
      labs(title = paste0(predictor," vs ", response),
           subtitle = res)
   
   return(p)
   
}

p1 <- boxplot_tt("age", "male")
p2 <- boxplot_tt("age", "prevalent_hyp")
p3 <- boxplot_tt("age", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

bmi

p1 <- boxplot_tt("bmi", "male")
p2 <- boxplot_tt("bmi", "prevalent_hyp")
p3 <- boxplot_tt("bmi", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

hypertension (sys_bp)

p1 <- boxplot_tt("sys_bp", "male")
p2 <- boxplot_tt("sys_bp", "prevalent_hyp")
p3 <- boxplot_tt("sys_bp", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

glucose level

p1 <- boxplot_tt("glucose", "male")
p2 <- boxplot_tt("glucose", "prevalent_hyp")
p3 <- boxplot_tt("glucose", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

cholesterol level

p1 <- boxplot_tt("tot_chol", "male")
p2 <- boxplot_tt("tot_chol", "prevalent_hyp")
p3 <- boxplot_tt("tot_chol", "bp_meds")

grid.arrange(p1, p2, p3, nrow = 1)

NA Assessment

As a last step before fitting the models, we need to take care of NAs in these 6 variables. As mentioned in the beginning, we do this here in order not to bias the EDA.

In case of imputation, we will apply the same procedure blindly to the test set.

bmi is significantly different for males and females. There are only 12 values missing. We will impute the NAs with the median of each gender in either risk or no-risk group.

glucose and tot_chol are significantly different especially in people with and without hypertension. We will impute the NAs with the median of each gender in either risk or no-risk group.

bp_meds is a very important variable, and not easy to impute, so we don’t want to run risks here. We see from the table below that the people where blood pressure (bp) medication was not recorded have (median) bp values compatible with people who have either high or low blood pressure AND do not take medications. This could justify the imputation. At the same time, this is a very important variable, associated with drug use. We don’t want to run this risk. We will therefore discard people where it is not known whether they take medications.

cigs_per_day is a variable that can hardly be imputed correctly, therefore we will remove the data points with NA in this variable (less than 1%)

bmi

# bmi ~ male
ggbetweenstats(
   data = train,
   x = male,
   y = bmi
)

glucose

# glucose ~ hypertension
ggbetweenstats(
   data = train,
   x = prevalent_hyp,
   y = glucose
)

cholesterol

# cholesterol ~ hypertension
ggbetweenstats(
   data = train,
   x = prevalent_hyp,
   y = tot_chol
)

bp_meds

# bp_meds ~ prevalent_hyp
df %>% 
  select(sys_bp, prevalent_hyp, bp_meds) %>%
  group_by(bp_meds, prevalent_hyp) %>% 
  summarise(
    median_sys_bp = median(sys_bp),
    count_prevalent_hyp = n()
  ) 
## # A tibble: 5 × 4
## # Groups:   bp_meds [3]
##   bp_meds prevalent_hyp median_sys_bp count_prevalent_hyp
##   <fct>   <fct>                 <dbl>               <int>
## 1 0       0                      122                 2891
## 2 0       1                      150.                1170
## 3 1       1                      165                  124
## 4 <NA>    0                      121                   31
## 5 <NA>    1                      154.                  22

NA imputation on train and test set

# ------------------------- train --------------------------------------------
train <- train %>%
   # impute the bmi as the median of males/females
   group_by(ten_year_chd, male) %>%
   mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
   ungroup() %>%
   
   # impute glucose and cholesterol level as the median of prevalent_hyp
   group_by(ten_year_chd, prevalent_hyp) %>%
   mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
   mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
   ungroup() %>% 
   
   # remove people where bp_meds is not unknown
   filter(!is.na(bp_meds)) %>% 
   
   # remove people with no information about education level
   filter(!is.na(education)) %>% 
   
   # # remove the mean from numerical variables
   # mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>% 
   
   # remove residual NAs
   na.omit()
   
   
   

train %>% 
   select(
      ten_year_chd, age, sys_bp, bmi, glucose, tot_chol, 
      male, prevalent_hyp, bp_meds, education
   ) %>%
   diagnose %>%
   flextable
# ------------------------- test --------------------------------------------
test <- test %>%
   # impute the bmi as the median of males/females
   group_by(ten_year_chd, male) %>%
   mutate(bmi = ifelse(is.na(bmi), median(bmi, na.rm = T),bmi)) %>%
   ungroup() %>%
   
   # impute glucose and cholesterol level as the median of prevalent_hyp
   group_by(ten_year_chd, prevalent_hyp) %>%
   mutate(glucose = ifelse(is.na(glucose), median(glucose, na.rm=T), glucose)) %>%
   mutate(tot_chol = ifelse(is.na(tot_chol), median(tot_chol, na.rm=T), tot_chol)) %>%
   ungroup() %>% 
   
   # remove people where bp_meds is not unknown
   filter(!is.na(bp_meds)) %>% 
   
   # remove people with no information about education level
   filter(!is.na(education)) %>% 
   
   # # remove the mean from numerical variables
   # mutate_if(is.numeric, ~ . - mean(., na.rm = TRUE)) %>% 
   
   # remove residual NAs
   na.omit()




test %>% 
   select(
      ten_year_chd, age, sys_bp, bmi, glucose, tot_chol, 
      male, prevalent_hyp, bp_meds, education
   ) %>%
   diagnose %>%
   flextable

Analytic strategy

It appears that there are two interesting questions:

  1. How well is the risk predicted by generic factors alone? (such as sex, age, and blood pressure)

  2. How much can we increase the prediction if we include other variables more related to hypertension, such as diagnosed hypertension or assumption of blood pressure medications?

Since we allowed some correlated variables to enter the model, we will pay close attention to the VIF (variance inflation factor).

After the “best” model is detemined by manual stepwise regression, we will compare the former to an automatic stepwise regression carried out using the AIC criterion alone.

Logistic regression

Procedure for stepwise modelling

We start by fitting the “maximal” model, which includes all the provided EVs, to have a baseline estimate for the AUC of this model and of the impact of the correlation between variables we previously detected with EDA - for the latter, we will check the VIF.

Then we fit the full model, which includes all the variables we previously selected based on the EDA. This will show how well the hints gathered in the EDA are actually reflected in the significance of the coefficients. Specifically, compared to the “maximal” model.

We will then proceed to eliminate EVs based on whether the coefficients are not sigificant or have a very small effect (in terms of odds ratio). At each step we evaluate the AIC and AUC to detect whether reducing the EVs led to an increase or decrease of the performance.

Finally we will examine the confusion matrix and adjust the threshold for binary prediction in order to have maximum Sensitivity (more on this later).

Format of the summary tables

The coefficients are reported in two forms (at the expense of verbosity):

  • original, non-standardized and non-exponentiated (using summary()): to allow calculation of the odds ratio (via exp()) of 10 year CHD given the actual values of the EVs for a specific person.

  • standardized and exponentiated (using jtools::summ()): to allow comparing the magnitude of the coefficients, to have the confidence intervals and to calculate the VIF.

Results

The AIC was similar across all examined models (between 1887 and 1890), therefore this metric had a small weight in the choice of the final model. Similarly, all the models explained at most 17% of the total variance in the data, which is not a substantial amount.

The VIF was not critical, even in the maximal model, therefore the impact of correlated predictors was minimal or negligible.

The maximal model highlighted the contribution of gender, age, systolic blood pressure, and to a lesser extent the prevalence of hypertension and education. Importantly, the p values for some of these variables was higher than in simpler models, most likely due to collinearity with other variables. The AUC of this model was 0.703.

Our full model highlighted the same variables. In addition reducing the number of EVs increased the significance of most of the selected EVs (especially age and systolic blood pressure). The AUC of this model was 0.705.

We then removed EVs with small or no effect from the full model, specifically blood pressure medications, bmi, cholesterol and education. This led to an AUC of 0.711, and all variables significant.

Further removing the EVs specifying whether the patient had hypertension led to a simpler model and did not substantially change the AUC (0.713).

Remarkably, this model contained only gender, age, systolic blood pressure and glucose blood level, in addition to number of cigarettes smoked per day. That is, two common demographic variables, and two physiologic measurements which can be easily retrieved with blood pressure measurement and a simple blood exam

The increase in odds to develop a CHD in 10 years were as follows for each EV:

  • sex: males have \(exp(0.449) \approx 57\%\) greater odds of CHD risk than females
  • cigarettes per day : \(exp(0.1766) \approx 19\%\) greater odds for any additional 10 cigarettes smoked every day
  • age : the odds of CHD increase \(exp(0.0658) \approx 6.8\%\) every year
  • systolic blood pressure : \(exp(0.0186) \approx 1.8\%\) greater odds for each mmHg
  • glucose : \(exp(0.0075) \approx 0.7\%\) greater odds for each mg

The intercept can be interpreted as the probability of a female being at risk when all predictors are set to 0. In our case we have predictors for age, glucose and blood pressure, which are obviously not zero, therefore this interpretation of the intercept is not really helpful in our case. For the sake of calculation: \(p = \frac{e^{\beta_0}}{1 + e^{\beta_0}} = \frac{exp(-8.643)}{1+exp(-8.643)} = 0.000176\)

It is at this point very interesting to notice that:

  1. Assumption of blood pressure medications does not reduce the 10 years risk of CHD
  2. Having hypertension has only a small effect on the prediction of developing a risk of CHD in 10 years

This is a very interesting result from a practical perspective, since it means that generic demographic and physiological variables are much more predictive of the risk of CHD in 10 years with respect to a diagnosis of hypertension, and that the assumption of blood pressure medicaments - presumably to normalize the blood pressure level in people with hypertension - does not reduce the risk of CHD

Our final model therefore includes:

  • sex
  • age
  • systolic blood pressure
  • glucose
  • cigarettes per day

Afterwards, we evaluated the performance of our model on the test data. Using a default probability of 0.5 for binary prediction leads to a poor Sensitivity, in that only ~ 7% of the people with a real 10 year CHD risk (\(\frac{16}{16+219}\)) are predicted as being at risk.

Since in our case the aim is to maximise the detection of people who are truly at risk - even at expense of inflating the number of false negatives - we lowered the threshold of binary prediction to 0.1. This allowed us to reach a sensitivity of 84%.

The automatic stepwise regression (stepAIC) identifies the same variables as our final model, plus a small (uncorrected significant) contribution of education level and prevalence of hypertension. However, this model has a smaller AUC (0.703) than our final model, and is also more complex. Therefore we retained our final model.

Finally, we hypothesized that the imbalance between Sensitivity and Specificity could be due to the high proportion of people not at risk in our dataset. Therefore, as a last test, we sampled an equal number of people with and without risk.

In this balanced samples model the AUC increased to 0.738. By using a threshold for binary decision of 0.3, this model correctly predicts 91.4% of the people with actual CHD risk, and a false negative rate of 70%. By adopting a more liberal threshold of 0.4, the model predicts 81.3% of the people with actual CHD, and about 50% of the people with no risk of CHD.

A final remark This model captures only a fraction of the total variance in the data (~ 15%). It is odd that other variables widely thought to be associated with cardiovascular diseases were not recorded, such as stress level (e.g. # hours of work per day/week), alcohol comsumption, quality of the air.


do_logistic_regression : a function to carry out the logistic regression and print out the summary of the model, the ROC curve - and the AUC - as well as the confusion matrix.

plot_ROC : a function that generates an interactive ROC that allows to inspect Sensitivity and Specificity for different levels of threshold for binary decision

do_logistic_regression <- function(EVs, train_data = train, test_data = test, thresh = 0.5) {
   
   # Assemble the formula
   formula_string <- paste(EVs, collapse = " + ")
   formula_obj <- as.formula(paste("ten_year_chd ~", formula_string))
   
   fit <- glm(formula_obj, family = "binomial", data = train_data)

   
   # # Standard R summary
   summary(fit) %>% print()
   
   
   # Print a summary
   # summary(fit) %>% print()
   if (length(EVs) > 1) {
      jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
   } else {
      jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
   }

   # Estimated probability values from the logistic model
   phat = predict(fit, newdata = test_data, type = "response")

   # Confusion matrix and derived quantities - using caret::confusionmatrix
   sprintf("\n Using a threshold for prediction = %.2f \n\n", thresh) %>% cat()
   
   thresh <- thresh
   predicted <- ifelse(phat > thresh, 1, 0)
   
   actual_predicted <- tibble(
      actual = test_data$ten_year_chd, 
      predicted = ifelse(phat > thresh, 1, 0) 
   )
   
   caret::confusionMatrix(
      as.factor(predicted), # predicted
      as.factor(test_data$ten_year_chd), # actual 
      positive = "1"  # the positive class in our case is "1"
   ) %>% print()
   
   return(phat)
}


plot_ROC <- function(test_data = test, phat = phat) {
 # Define the probability thresholds you want to use
   pthresh <- seq(0, 1, 0.1)
   
   # Create the ROC curve
   roc_curve <- roc(test_data$ten_year_chd, phat, plot = F, print.auc=T)
   
   # Get the coordinates (sensitivity and specificity) for the specified thresholds
   coordinates <- pROC::coords(roc_curve, pthresh)

    # Using highcharter
    h <- coordinates %>% 
    hchart(
        type = "line",
        hcaes(x = specificity, y = sensitivity, threshold = threshold)
    ) %>% 
    hc_xAxis(title = list(text = "Specificity"), reversed = TRUE, min = 0, max = 1) %>%
    hc_yAxis(title = list(text = "Sensitivity"), min = 0, max = 1) %>% 
    hc_add_series(
        data = coordinates,
        type = "scatter",
        hcaes(x = specificity, y = sensitivity, threshold = threshold),
        marker = list(radius = 6, fillColor = "lightblue", lineWidth = 2),
        tooltip = list(
            headerFormat = "",
            pointFormat = "<b>Threshold: {point.threshold:.2f}</b><br>
                           Specificity: {point.x:.2f}<br>
                           Sensitivity: {point.y:.2f}")
    ) %>% 
     hc_chart(aspectRatio = 1) %>%   # Set the equal aspect ratio
     hc_title(text = paste("ROC curve - AUC =", round(roc_curve$auc,3)))
    
    return(h)
}

Manual stepwise

Maximal model

Maximal model, including all the variables in the dataset. We use this only as a benchmark for the models instructed by EDA.

We observe high VIF values in sys_bp and dia_bp - as expected.

EVs <- train %>% select(-ten_year_chd) %>% colnames()
phat <- do_logistic_regression(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8415  -0.5828  -0.4175  -0.2897   2.8226  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.4497344  0.8561698  -8.701  < 2e-16 ***
## male1              0.3800302  0.1329243   2.859  0.00425 ** 
## age                0.0608765  0.0083022   7.333 2.26e-13 ***
## education2        -0.3243521  0.1517706  -2.137  0.03259 *  
## education3        -0.3327557  0.1860997  -1.788  0.07377 .  
## education4        -0.0502180  0.1969542  -0.255  0.79874    
## current_smoker1    0.1717930  0.1922979   0.893  0.37166    
## cigs_per_day       0.0149836  0.0078998   1.897  0.05787 .  
## bp_meds1           0.3755894  0.2932894   1.281  0.20033    
## prevalent_stroke1  0.8294602  0.6158164   1.347  0.17800    
## prevalent_hyp1     0.3669952  0.1660561   2.210  0.02710 *  
## diabetes1          0.3126526  0.3897172   0.802  0.42241    
## tot_chol           0.0004310  0.0014095   0.306  0.75980    
## sys_bp             0.0121563  0.0046474   2.616  0.00890 ** 
## dia_bp             0.0004736  0.0078561   0.060  0.95193    
## bmi                0.0141265  0.0151335   0.933  0.35058    
## heart_rate        -0.0068913  0.0050928  -1.353  0.17601    
## glucose            0.0063839  0.0028683   2.226  0.02604 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1854.4  on 2453  degrees of freedom
## AIC: 1890.4
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(17) = 259.91, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1890.45, BIC = 1995.07 
## 
## Standard errors: MLE
## -------------------------------------------------------------------------
##                           exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ----------------------- ----------- ------ ------- -------- ------ ------
## (Intercept)                    0.00   0.00    0.00    -8.70   0.00       
## male1                          1.46   1.13    1.90     2.86   0.00   1.25
## age                            1.06   1.05    1.08     7.33   0.00   1.32
## education2                     0.72   0.54    0.97    -2.14   0.03   1.13
## education3                     0.72   0.50    1.03    -1.79   0.07   1.13
## education4                     0.95   0.65    1.40    -0.25   0.80   1.13
## current_smoker1                1.19   0.81    1.73     0.89   0.37   2.61
## cigs_per_day                   1.02   1.00    1.03     1.90   0.06   2.74
## bp_meds1                       1.46   0.82    2.59     1.28   0.20   1.11
## prevalent_stroke1              2.29   0.69    7.66     1.35   0.18   1.04
## prevalent_hyp1                 1.44   1.04    2.00     2.21   0.03   1.94
## diabetes1                      1.37   0.64    2.93     0.80   0.42   1.88
## tot_chol                       1.00   1.00    1.00     0.31   0.76   1.07
## sys_bp                         1.01   1.00    1.02     2.62   0.01   3.64
## dia_bp                         1.00   0.99    1.02     0.06   0.95   2.92
## bmi                            1.01   0.98    1.04     0.93   0.35   1.23
## heart_rate                     0.99   0.98    1.00    -1.35   0.18   1.10
## glucose                        1.01   1.00    1.01     2.23   0.03   1.89
## -------------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1329  213
##          1   16   22
##                                           
##                Accuracy : 0.8551          
##                  95% CI : (0.8367, 0.8721)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.3513          
##                                           
##                   Kappa : 0.1249          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.09362         
##             Specificity : 0.98810         
##          Pos Pred Value : 0.57895         
##          Neg Pred Value : 0.86187         
##              Prevalence : 0.14873         
##          Detection Rate : 0.01392         
##    Detection Prevalence : 0.02405         
##       Balanced Accuracy : 0.54086         
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Full with selected variables

This is the “full” model with all the variables chosen based on the EDA.

No problems with collinearity.

EVs  <- c(
   "male", "prevalent_hyp", "bp_meds", "education",
   "age", "sys_bp", "bmi", "glucose", "tot_chol", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9398  -0.5882  -0.4186  -0.2876   2.8221  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -7.8924235  0.7224553 -10.924  < 2e-16 ***
## male1           0.3996399  0.1306947   3.058 0.002230 ** 
## prevalent_hyp1  0.3564960  0.1628251   2.189 0.028565 *  
## bp_meds1        0.4381388  0.2883448   1.519 0.128638    
## education2     -0.3319065  0.1512202  -2.195 0.028174 *  
## education3     -0.3409724  0.1858949  -1.834 0.066621 .  
## education4     -0.0373354  0.1959238  -0.191 0.848870    
## age             0.0613141  0.0080512   7.616 2.63e-14 ***
## sys_bp          0.0118211  0.0034909   3.386 0.000709 ***
## bmi             0.0137041  0.0146029   0.938 0.348011    
## glucose         0.0075754  0.0021169   3.579 0.000345 ***
## tot_chol        0.0002287  0.0014069   0.163 0.870891    
## cigs_per_day    0.0191629  0.0052321   3.663 0.000250 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1859.5  on 2458  degrees of freedom
## AIC: 1885.5
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(12) = 254.88, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1885.48, BIC = 1961.04 
## 
## Standard errors: MLE
## ----------------------------------------------------------------------
##                        exp(Est.)   2.5%   97.5%   z val.      p    VIF
## -------------------- ----------- ------ ------- -------- ------ ------
## (Intercept)                 0.00   0.00    0.00   -10.92   0.00       
## male1                       1.49   1.15    1.93     3.06   0.00   1.21
## prevalent_hyp1              1.43   1.04    1.97     2.19   0.03   1.87
## bp_meds1                    1.55   0.88    2.73     1.52   0.13   1.09
## education2                  0.72   0.53    0.97    -2.19   0.03   1.11
## education3                  0.71   0.49    1.02    -1.83   0.07   1.11
## education4                  0.96   0.66    1.41    -0.19   0.85   1.11
## age                         1.06   1.05    1.08     7.62   0.00   1.25
## sys_bp                      1.01   1.00    1.02     3.39   0.00   2.06
## bmi                         1.01   0.99    1.04     0.94   0.35   1.16
## glucose                     1.01   1.00    1.01     3.58   0.00   1.02
## tot_chol                    1.00   1.00    1.00     0.16   0.87   1.06
## cigs_per_day                1.02   1.01    1.03     3.66   0.00   1.22
## ----------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1332  219
##          1   13   16
##                                           
##                Accuracy : 0.8532          
##                  95% CI : (0.8347, 0.8703)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.433           
##                                           
##                   Kappa : 0.0915          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06809         
##             Specificity : 0.99033         
##          Pos Pred Value : 0.55172         
##          Neg Pred Value : 0.85880         
##              Prevalence : 0.14873         
##          Detection Rate : 0.01013         
##    Detection Prevalence : 0.01835         
##       Balanced Accuracy : 0.52921         
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Remove no-effect EVs

Remove education, bp_meds, bmi, tot_chol since they have no effect on the model.

The AIC slightly decreases with respect to the full model.

# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs  <- c(
   "male", "prevalent_hyp",
   "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1088  -0.5889  -0.4241  -0.2951   2.7635  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -8.027822   0.562910 -14.261  < 2e-16 ***
## male1           0.434934   0.127776   3.404 0.000664 ***
## prevalent_hyp1  0.387263   0.160552   2.412 0.015862 *  
## age             0.064891   0.007775   8.346  < 2e-16 ***
## sys_bp          0.013408   0.003393   3.952 7.76e-05 ***
## glucose         0.007572   0.002106   3.595 0.000324 ***
## cigs_per_day    0.018188   0.005197   3.500 0.000465 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1870.0  on 2464  degrees of freedom
## AIC: 1884
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(6) = 244.31, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.12
## AIC = 1884.05, BIC = 1924.73 
## 
## Standard errors: MLE
## ----------------------------------------------------------------------
##                        exp(Est.)   2.5%   97.5%   z val.      p    VIF
## -------------------- ----------- ------ ------- -------- ------ ------
## (Intercept)                 0.00   0.00    0.00   -14.26   0.00       
## male1                       1.54   1.20    1.98     3.40   0.00   1.16
## prevalent_hyp1              1.47   1.08    2.02     2.41   0.02   1.83
## age                         1.07   1.05    1.08     8.35   0.00   1.17
## sys_bp                      1.01   1.01    1.02     3.95   0.00   1.95
## glucose                     1.01   1.00    1.01     3.60   0.00   1.01
## cigs_per_day                1.02   1.01    1.03     3.50   0.00   1.21
## ----------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1335  219
##          1   10   16
##                                           
##                Accuracy : 0.8551          
##                  95% CI : (0.8367, 0.8721)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.3513          
##                                           
##                   Kappa : 0.0958          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06809         
##             Specificity : 0.99257         
##          Pos Pred Value : 0.61538         
##          Neg Pred Value : 0.85907         
##              Prevalence : 0.14873         
##          Detection Rate : 0.01013         
##    Detection Prevalence : 0.01646         
##       Balanced Accuracy : 0.53033         
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Remove prevalent_hyp

Remove prevalent_hyp since it is only marginally significant

Note that the AIC only marginally increases (less than 0.2%) and this model is much simpler and requires only 4 variables, two of which are common demographic variables (age and sex), while the other two can be easily gathered through a pressure measurement and a blood test.

# Remove prevalent_hyp since it is only marginally significant
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1053  -0.5899  -0.4330  -0.2982   2.8162  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.643579   0.504565 -17.131  < 2e-16 ***
## male1         0.449590   0.127385   3.529 0.000417 ***
## age           0.065845   0.007745   8.502  < 2e-16 ***
## sys_bp        0.018684   0.002597   7.194 6.29e-13 ***
## glucose       0.007547   0.002112   3.573 0.000353 ***
## cigs_per_day  0.017858   0.005184   3.445 0.000571 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1875.8  on 2465  degrees of freedom
## AIC: 1887.8
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69 
## 
## Standard errors: MLE
## --------------------------------------------------------------------
##                      exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept)               0.00   0.00    0.00   -17.13   0.00       
## male1                     1.57   1.22    2.01     3.53   0.00   1.16
## age                       1.07   1.05    1.08     8.50   0.00   1.17
## sys_bp                    1.02   1.01    1.02     7.19   0.00   1.14
## glucose                   1.01   1.00    1.01     3.57   0.00   1.01
## cigs_per_day              1.02   1.01    1.03     3.44   0.00   1.20
## --------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1337  219
##          1    8   16
##                                           
##                Accuracy : 0.8563          
##                  95% CI : (0.8381, 0.8733)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.3             
##                                           
##                   Kappa : 0.0987          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06809         
##             Specificity : 0.99405         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.85925         
##              Prevalence : 0.14873         
##          Detection Rate : 0.01013         
##    Detection Prevalence : 0.01519         
##       Balanced Accuracy : 0.53107         
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Assess categorical interactions

The number of cigarettes per day is a relevant factor, although being a current smoker is not. Since there is an interaction between being male, current smoker and having hypertension, it is worth exploring whether including these interactions would improve the performance of the model.

However, the results show that the coefficients associated with male:current_smoker and current_smoker:prevalent_hyp are not significant - especially after correction. Also, they are highly collinear with cigs_per_day - which becomes not significant - and do not improve the model performance.

For all these resasons, these interactions will not be included in the final model.

# Try to include `male:current_smoker` and `current_smoker:prevalent_hyp`
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "male:current_smoker", "current_smoker:prevalent_hyp"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0959  -0.5871  -0.4262  -0.2940   2.7761  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -8.071834   0.575043 -14.037  < 2e-16 ***
## male1                           0.411137   0.177391   2.318 0.020466 *  
## age                             0.064531   0.007790   8.284  < 2e-16 ***
## sys_bp                          0.013450   0.003398   3.959 7.54e-05 ***
## glucose                         0.007486   0.002101   3.563 0.000367 ***
## male0:current_smoker1           0.386439   0.213094   1.813 0.069760 .  
## male1:current_smoker1           0.560270   0.206397   2.715 0.006637 ** 
## current_smoker0:prevalent_hyp1  0.472103   0.200758   2.352 0.018693 *  
## current_smoker1:prevalent_hyp1  0.287916   0.201998   1.425 0.154058    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1871.3  on 2462  degrees of freedom
## AIC: 1889.3
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(8) = 243.03, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1889.32, BIC = 1941.64 
## 
## Standard errors: MLE
## -------------------------------------------------------------------------------
##                                        exp(Est.)   2.5%   97.5%   z val.      p
## ------------------------------------ ----------- ------ ------- -------- ------
## (Intercept)                                 0.00   0.00    0.00   -14.04   0.00
## male1                                       1.51   1.07    2.14     2.32   0.02
## age                                         1.07   1.05    1.08     8.28   0.00
## sys_bp                                      1.01   1.01    1.02     3.96   0.00
## glucose                                     1.01   1.00    1.01     3.56   0.00
## male0:current_smoker1                       1.47   0.97    2.23     1.81   0.07
## male1:current_smoker1                       1.75   1.17    2.62     2.71   0.01
## current_smoker0:prevalent_hyp1              1.60   1.08    2.38     2.35   0.02
## current_smoker1:prevalent_hyp1              1.33   0.90    1.98     1.43   0.15
## -------------------------------------------------------------------------------
##  
## -------------------------------------------
##                                         VIF
## ------------------------------------ ------
## (Intercept)                                
## male1                                  2.24
## age                                    1.18
## sys_bp                                 1.96
## glucose                                1.01
## male0:current_smoker1                  4.18
## male1:current_smoker1                  4.18
## current_smoker0:prevalent_hyp1         3.57
## current_smoker1:prevalent_hyp1         3.57
## -------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1332  221
##          1   13   14
##                                           
##                Accuracy : 0.8519          
##                  95% CI : (0.8334, 0.8691)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.4892          
##                                           
##                   Kappa : 0.0786          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.059574        
##             Specificity : 0.990335        
##          Pos Pred Value : 0.518519        
##          Neg Pred Value : 0.857695        
##              Prevalence : 0.148734        
##          Detection Rate : 0.008861        
##    Detection Prevalence : 0.017089        
##       Balanced Accuracy : 0.524955        
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Sigmoid for age

We also decided to try to model age as a sigmoid, under the assumption that the odds of a CHD risk in ten years were overall smaller for people < 40 and would increase at a steeper rate afterwards, however the perfomance of the model did not change.

sigmoid_function <- function(x, a, b, c, d) {
  # Sigmoid function formula: a + (b - a) / (1 + exp(-c * (x - d)))
  return(a + (b - a) / (1 + exp(-c * (x - d))))
}

# Define the parameters for the sigmoid function to control the curve
a_linear <- 0      # Start value (linear increase starts from 0)
b_linear <- 1      # End value of linear increase (1 represents 100%)
c_steepness <- 0.15 # Steepness factor for the sigmoid curve (adjust as needed)
d_transition <- 50 # Age at which the transition occurs (50 in your case)


train_sigmoid_age <- train %>% 
 mutate(age_sigmoid = sigmoid_function(age, a_linear, b_linear, c_steepness, d_transition))

test_sigmoid_age <- test %>% 
 mutate(age_sigmoid = sigmoid_function(age, a_linear, b_linear, c_steepness, d_transition))


plot(train_sigmoid_age$age, train_sigmoid_age$age_sigmoid)

# The proxy_age variable now has a smooth transition, increasing linearly until 50 and more steeply after that.



EVs  <- c(
   "male", "age_sigmoid", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(
 train_data = train_sigmoid_age,
 test_data = test_sigmoid_age,
 EVs, thresh = 0.1
)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0739  -0.5929  -0.4279  -0.2942   2.7784  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.433353   0.401903 -16.007  < 2e-16 ***
## male1         0.449367   0.127351   3.529 0.000418 ***
## age_sigmoid   2.197337   0.258184   8.511  < 2e-16 ***
## sys_bp        0.018622   0.002598   7.167 7.64e-13 ***
## glucose       0.007502   0.002109   3.556 0.000376 ***
## cigs_per_day  0.017897   0.005184   3.452 0.000556 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1874.9  on 2465  degrees of freedom
## AIC: 1886.9
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 239.49, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1886.87, BIC = 1921.74 
## 
## Standard errors: MLE
## --------------------------------------------------------------------
##                      exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept)               0.00   0.00    0.00   -16.01   0.00       
## male1                     1.57   1.22    2.01     3.53   0.00   1.16
## age_sigmoid               9.00   5.43   14.93     8.51   0.00   1.17
## sys_bp                    1.02   1.01    1.02     7.17   0.00   1.14
## glucose                   1.01   1.00    1.01     3.56   0.00   1.01
## cigs_per_day              1.02   1.01    1.03     3.45   0.00   1.20
## --------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.10 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 584  39
##          1 761 196
##                                           
##                Accuracy : 0.4937          
##                  95% CI : (0.4687, 0.5186)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1183          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8340          
##             Specificity : 0.4342          
##          Pos Pred Value : 0.2048          
##          Neg Pred Value : 0.9374          
##              Prevalence : 0.1487          
##          Detection Rate : 0.1241          
##    Detection Prevalence : 0.6057          
##       Balanced Accuracy : 0.6341          
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test_sigmoid_age, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Final model using thr = 0.5

We decided to include the following variables in the final model:

  • male
  • age
  • sys_bp
  • glucose
  • cigs_per_day

At this point, we evaluate the performance of the model in terms of sensitivity (TP/TP+FN) using a default threshold of 0.5 for binary decision on the test set.

Our aim is to maximise the amount of people at risk which are predicted to be so by the model.

# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives

EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logistic_regression(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1053  -0.5899  -0.4330  -0.2982   2.8162  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.643579   0.504565 -17.131  < 2e-16 ***
## male1         0.449590   0.127385   3.529 0.000417 ***
## age           0.065845   0.007745   8.502  < 2e-16 ***
## sys_bp        0.018684   0.002597   7.194 6.29e-13 ***
## glucose       0.007547   0.002112   3.573 0.000353 ***
## cigs_per_day  0.017858   0.005184   3.445 0.000571 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1875.8  on 2465  degrees of freedom
## AIC: 1887.8
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69 
## 
## Standard errors: MLE
## --------------------------------------------------------------------
##                      exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept)               0.00   0.00    0.00   -17.13   0.00       
## male1                     1.57   1.22    2.01     3.53   0.00   1.16
## age                       1.07   1.05    1.08     8.50   0.00   1.17
## sys_bp                    1.02   1.01    1.02     7.19   0.00   1.14
## glucose                   1.01   1.00    1.01     3.57   0.00   1.01
## cigs_per_day              1.02   1.01    1.03     3.44   0.00   1.20
## --------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1337  219
##          1    8   16
##                                           
##                Accuracy : 0.8563          
##                  95% CI : (0.8381, 0.8733)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.3             
##                                           
##                   Kappa : 0.0987          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06809         
##             Specificity : 0.99405         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.85925         
##              Prevalence : 0.14873         
##          Detection Rate : 0.01013         
##    Detection Prevalence : 0.01519         
##       Balanced Accuracy : 0.53107         
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Final model using thr = 0.1

Leaving the default threshold of 0.5 for binary prediction leads to identifying too few true positives. For this reason, we decided to maximise the Sensitivity (TP / (TP+FN)) even at the expense of assigning high risk of CHD to some people who are not.

This rationale is motivated by the fact that in this case the decision is not about doing or not doing an open-heart surgery, but rather to start to monitor people who might be at high risk. Plus, according the to the model, an accurate monitoring only requires inexpensive procedures: blood test and pressure measurements.

For this reason, it is of paramount important to include all the people who might be at risk even if they will turn out not to be so.

The threshold for risk detection is accordingly lowered to 0.1.

This leads to determine that almost 50% of the people who are not at risk will be deemed to be actually at risk. However, this choice will lead to a Sensitivity of about 80%.

NB: We also replace the cigs_per_day with units of tens of cigarettes

# The main aim is to use the model to identify the highest amount of true positives
# and minimize false negatives

# EVs  <- c(
#    "male", "age", "sys_bp", "glucose", "cigs_per_day"
# )
# 
# phat <- do_logistic_regression(EVs, thresh = 0.1)


train_cig_factor <- train %>% 
 mutate(tens_cigs = round(cigs_per_day/10,0))

test_cig_factor <- test %>% 
 mutate(tens_cigs = round(cigs_per_day/10,0))


EVs  <- c(
   "male", "age", "sys_bp", "glucose", "tens_cigs"
)

phat <- do_logistic_regression(
 EVs, 
 train_data = train_cig_factor, 
 test_data = test_cig_factor, 
 thresh = 0.1
)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1007  -0.5886  -0.4341  -0.2981   2.8165  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.636592   0.503982 -17.137  < 2e-16 ***
## male1        0.448757   0.127401   3.522 0.000428 ***
## age          0.065813   0.007742   8.501  < 2e-16 ***
## sys_bp       0.018675   0.002597   7.191 6.45e-13 ***
## glucose      0.007510   0.002108   3.562 0.000368 ***
## tens_cigs    0.176640   0.051338   3.441 0.000580 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1875.8  on 2465  degrees of freedom
## AIC: 1887.8
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.81, BIC = 1922.69 
## 
## Standard errors: MLE
## -------------------------------------------------------------------
##                     exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ----------------- ----------- ------ ------- -------- ------ ------
## (Intercept)              0.00   0.00    0.00   -17.14   0.00       
## male1                    1.57   1.22    2.01     3.52   0.00   1.16
## age                      1.07   1.05    1.08     8.50   0.00   1.17
## sys_bp                   1.02   1.01    1.02     7.19   0.00   1.14
## glucose                  1.01   1.00    1.01     3.56   0.00   1.01
## tens_cigs                1.19   1.08    1.32     3.44   0.00   1.20
## -------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.10 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 588  38
##          1 757 197
##                                           
##                Accuracy : 0.4968          
##                  95% CI : (0.4719, 0.5218)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1218          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8383          
##             Specificity : 0.4372          
##          Pos Pred Value : 0.2065          
##          Neg Pred Value : 0.9393          
##              Prevalence : 0.1487          
##          Detection Rate : 0.1247          
##    Detection Prevalence : 0.6038          
##       Balanced Accuracy : 0.6377          
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Code to manually calculate the Confusion Matrix and its derivatives. Values are not shown. This was just an excercise.

# # Binary prediction from the estimated probability values 
# thresh <- 0.1
# predicted <- ifelse(phat > thresh, 1, 0)
# 
# actual_predicted <- tibble(
#    actual = test$ten_year_chd, 
#    predicted = ifelse(phat > thresh, 1, 0) 
# )
# 
# cm <- actual_predicted %>%
#   summarise(
#     TP = sum(actual == 1 & predicted == 1),
#     TN = sum(actual == 0 & predicted == 0),
#     FP = sum(actual == 0 & predicted == 1),
#     FN = sum(actual == 1 & predicted == 0)
#   )
# 
# cm %>% print
# 
# TP <- cm$TP
# TN <- cm$TN
# FP <- cm$FP
# FN <- cm$FN
# 
# 
# accuracy <- (TP + TN) / (TP+TN+FP+FN)
# print(paste0("Accuracy = ", round(accuracy,4)))
# 
# sensitivity <- TP / (TP + FN)
# print(paste0("Sensitivity / Recall / TPR = ", round(sensitivity,4)))
# 
# specificity <- TN / (TN + FP)
# print(paste0("Specificity / TNR = ", round(specificity,4)))
# 
# PPV <- TP / (TP+FP)
# print(paste0("Positive Predictive Value = ", round(PPV,2)))
# 
# NPV <- TN / (TN+FN)
# print(paste0("Negative Predictive Value = ", round(NPV,2)))

Automatic stepwise

This particular stepwise regression leads to a much more complex model. It is important to note that the particular order of the predictors can change the final estimated “best” model (based on AIC).

However what is very important is that the cigs_per_day is identified as an important predictor, despite our EDA repeatedly disconfirmed so. The reasons are therefore to be explored, however for the time being we will try to include it into our manually built model (see above.)

library(MASS)

fullModel = glm(
   ten_year_chd ~ male + prevalent_hyp + bp_meds + age + sys_bp + glucose + tot_chol,
   family = "binomial", data = train
)

fullModel = glm(ten_year_chd ~ ., family = "binomial", data = train)

nullModel = glm(ten_year_chd ~ 1, family = "binomial", data = train)

step_fit <- stepAIC(
   fullModel, direction = 'backward', 
   scope = list(upper = fullModel, lower = nullModel), trace = 0
)

step_fit %>% summary()
## 
## Call:
## glm(formula = ten_year_chd ~ male + age + education + cigs_per_day + 
##     prevalent_stroke + prevalent_hyp + sys_bp + glucose, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0023  -0.5868  -0.4197  -0.2915   2.7940  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -7.655170   0.587371 -13.033  < 2e-16 ***
## male1              0.387895   0.129655   2.992 0.002774 ** 
## age                0.060750   0.007983   7.610 2.74e-14 ***
## education2        -0.335970   0.150240  -2.236 0.025337 *  
## education3        -0.352595   0.184679  -1.909 0.056232 .  
## education4        -0.038528   0.195294  -0.197 0.843606    
## cigs_per_day       0.019113   0.005227   3.657 0.000256 ***
## prevalent_stroke1  0.969730   0.601408   1.612 0.106868    
## prevalent_hyp1     0.372800   0.161896   2.303 0.021295 *  
## sys_bp             0.013282   0.003412   3.893 9.92e-05 ***
## glucose            0.007787   0.002106   3.698 0.000217 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1860.2  on 2460  degrees of freedom
## AIC: 1882.2
## 
## Number of Fisher Scoring iterations: 5
step_fit %>% jtools::summ(confint = T, vifs = T, digits = 2) %>% print()
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(10) = 254.11, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.17
## Pseudo-R² (McFadden) = 0.12
## AIC = 1882.24, BIC = 1946.18 
## 
## Standard errors: MLE
## ----------------------------------------------------------------------
##                            Est.    2.5%   97.5%   z val.      p    VIF
## ----------------------- ------- ------- ------- -------- ------ ------
## (Intercept)               -7.66   -8.81   -6.50   -13.03   0.00       
## male1                      0.39    0.13    0.64     2.99   0.00   1.19
## age                        0.06    0.05    0.08     7.61   0.00   1.23
## education2                -0.34   -0.63   -0.04    -2.24   0.03   1.09
## education3                -0.35   -0.71    0.01    -1.91   0.06   1.09
## education4                -0.04   -0.42    0.34    -0.20   0.84   1.09
## cigs_per_day               0.02    0.01    0.03     3.66   0.00   1.22
## prevalent_stroke1          0.97   -0.21    2.15     1.61   0.11   1.01
## prevalent_hyp1             0.37    0.06    0.69     2.30   0.02   1.85
## sys_bp                     0.01    0.01    0.02     3.89   0.00   1.97
## glucose                    0.01    0.00    0.01     3.70   0.00   1.02
## ----------------------------------------------------------------------
# Estimated probability values from the logistic model
phat_step_fit = predict(step_fit, newdata = test, type = "response")

Evaluate predictivity of the stepwise model

# Binary prediction from the estimated probability values 
thresh <- 0.1
predicted <- ifelse(phat_step_fit > thresh, 1, 0)

actual_predicted <- tibble(
   actual = test$ten_year_chd, 
   predicted = ifelse(phat_step_fit > thresh, 1, 0) 
)

caret::confusionMatrix(
   as.factor(predicted), # predicted
   as.factor(test$ten_year_chd), # actual 
   positive = "1"  # the positive class in our case is "1"
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 608  38
##          1 737 197
##                                           
##                Accuracy : 0.5095          
##                  95% CI : (0.4845, 0.5344)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1304          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8383          
##             Specificity : 0.4520          
##          Pos Pred Value : 0.2109          
##          Neg Pred Value : 0.9412          
##              Prevalence : 0.1487          
##          Detection Rate : 0.1247          
##    Detection Prevalence : 0.5911          
##       Balanced Accuracy : 0.6452          
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(test, phat_step_fit)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Samples balanced for CHD risk

Our sample is very unbalanced. We will select an equal number of people with and without risk in both train and test set to assess whether this will help model performance.

set.seed(9999)

min_train <- min(
   nrow(train[train$ten_year_chd == 1,]),
   nrow(train[train$ten_year_chd == 0,])   
)

balanced_train <- train %>% 
   group_by(ten_year_chd) %>% 
   slice_sample(n = min_train)


min_test <- min(
   nrow(test[test$ten_year_chd == 1,]),
   nrow(test[test$ten_year_chd == 0,])   
)

balanced_test <- test %>% 
   group_by(ten_year_chd) %>% 
   slice_sample(n = min_test)


# Full model with the selected variables
EVs  <- c(
   "male", "prevalent_hyp", "bp_meds", "education",
   "age", "sys_bp", "bmi", "glucose", "tot_chol"
)


# Remove education, bp_meds, bmi, tot_chol since they have
# no effect on the model
EVs  <- c(
   "male", "prevalent_hyp",
   "age", "sys_bp", "glucose"
)


# Remove prevalent_hyp since it is only marginally significant
EVs  <- c(
   "male", "age", "sys_bp", "glucose"
)


# The stepwise regression - below - identified cigs_per_day as an important factor.
# I will therefore add it to the model.
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)


phat <- do_logistic_regression(
   EVs, 
   train_data = balanced_train, test_data = balanced_test,
   thresh = 0.4
)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1032  -1.0034  -0.1717   1.0161   2.0910  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.238967   0.720868 -10.042  < 2e-16 ***
## male1         0.319629   0.169510   1.886  0.05935 .  
## age           0.074519   0.010682   6.976 3.04e-12 ***
## sys_bp        0.015708   0.003688   4.259 2.06e-05 ***
## glucose       0.010113   0.003923   2.578  0.00994 ** 
## cigs_per_day  0.030757   0.007706   3.991 6.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1048.04  on 755  degrees of freedom
## Residual deviance:  910.19  on 750  degrees of freedom
## AIC: 922.19
## 
## Number of Fisher Scoring iterations: 4
## 
## MODEL INFO:
## Observations: 756
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 137.85, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.22
## Pseudo-R² (McFadden) = 0.13
## AIC = 922.19, BIC = 949.96 
## 
## Standard errors: MLE
## --------------------------------------------------------------------
##                      exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept)               0.00   0.00    0.00   -10.04   0.00       
## male1                     1.38   0.99    1.92     1.89   0.06   1.12
## age                       1.08   1.06    1.10     6.98   0.00   1.18
## sys_bp                    1.02   1.01    1.02     4.26   0.00   1.11
## glucose                   1.01   1.00    1.02     2.58   0.01   1.01
## cigs_per_day              1.03   1.02    1.05     3.99   0.00   1.20
## --------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.40 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 115  44
##          1 120 191
##                                           
##                Accuracy : 0.6511          
##                  95% CI : (0.6061, 0.6941)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 2.818e-11       
##                                           
##                   Kappa : 0.3021          
##                                           
##  Mcnemar's Test P-Value : 4.727e-09       
##                                           
##             Sensitivity : 0.8128          
##             Specificity : 0.4894          
##          Pos Pred Value : 0.6141          
##          Neg Pred Value : 0.7233          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4064          
##    Detection Prevalence : 0.6617          
##       Balanced Accuracy : 0.6511          
##                                           
##        'Positive' Class : 1               
## 
plot_ROC(balanced_test, phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

CHD Prediction app

Use the following model - estimated on the whole dataset - to predict the 10-year risk of CHD according to the given input.

\[ logOdds_{10yr-CHD-risk} = -8.671 + 0.51*male + 0.061*age + 0.0173*sys\_bp + 0.00758*glucose + 0.0204 * cigs\_per\_day \]

\[ p = \frac{e^{logOdds}}{1 + e^{logOdds}} \]

NB: only for illustrational purposes. Read and accept the following Disclaimer first

do_logr <- function(EVs, train_data = train, test_data = test, thresh = 0.5) {
   
   # Assemble the formula
   formula_string <- paste(EVs, collapse = " + ")
   formula_obj <- as.formula(paste("ten_year_chd ~", formula_string))
   
   fit <- glm(formula_obj, family = "binomial", data = train_data)

   
   # # Standard R summary
   summary(fit) %>% print()
   
   
   # Print a summary
   # summary(fit) %>% print()
   if (length(EVs) > 1) {
      jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
   } else {
      jtools::summ(fit, confint = T, vifs = T, scale = F, exp = T, digits = 2) %>% print()
   }

   # Estimated probability values from the logistic model
   phat = predict(fit, newdata = test_data, type = "response")

   # Confusion matrix and derived quantities - using caret::confusionmatrix
   sprintf("\n Using a threshold for prediction = %.2f \n\n", thresh) %>% cat()
   
   thresh <- thresh
   predicted <- ifelse(phat > thresh, 1, 0)
   
   actual_predicted <- tibble(
      actual = test_data$ten_year_chd, 
      predicted = ifelse(phat > thresh, 1, 0) 
   )
   
   caret::confusionMatrix(
      as.factor(predicted), # predicted
      as.factor(test_data$ten_year_chd), # actual 
      positive = "1"  # the positive class in our case is "1"
   ) %>% print()
   
   return(phat)
}
EVs  <- c(
   "male", "age", "sys_bp", "glucose", "cigs_per_day"
)

phat <- do_logr(EVs, thresh = 0.5)
## 
## Call:
## glm(formula = formula_obj, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1053  -0.5899  -0.4330  -0.2982   2.8162  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -8.643579   0.504565 -17.131  < 2e-16 ***
## male1         0.449590   0.127385   3.529 0.000417 ***
## age           0.065845   0.007745   8.502  < 2e-16 ***
## sys_bp        0.018684   0.002597   7.194 6.29e-13 ***
## glucose       0.007547   0.002112   3.573 0.000353 ***
## cigs_per_day  0.017858   0.005184   3.445 0.000571 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2114.4  on 2470  degrees of freedom
## Residual deviance: 1875.8  on 2465  degrees of freedom
## AIC: 1887.8
## 
## Number of Fisher Scoring iterations: 5
## 
## MODEL INFO:
## Observations: 2471
## Dependent Variable: ten_year_chd
## Type: Generalized linear model
##   Family: binomial 
##   Link function: logit 
## 
## MODEL FIT:
## χ²(5) = 238.54, p = 0.00
## Pseudo-R² (Cragg-Uhler) = 0.16
## Pseudo-R² (McFadden) = 0.11
## AIC = 1887.82, BIC = 1922.69 
## 
## Standard errors: MLE
## --------------------------------------------------------------------
##                      exp(Est.)   2.5%   97.5%   z val.      p    VIF
## ------------------ ----------- ------ ------- -------- ------ ------
## (Intercept)               0.00   0.00    0.00   -17.13   0.00       
## male1                     1.57   1.22    2.01     3.53   0.00   1.16
## age                       1.07   1.05    1.08     8.50   0.00   1.17
## sys_bp                    1.02   1.01    1.02     7.19   0.00   1.14
## glucose                   1.01   1.00    1.01     3.57   0.00   1.01
## cigs_per_day              1.02   1.01    1.03     3.44   0.00   1.20
## --------------------------------------------------------------------
## 
##  Using a threshold for prediction = 0.50 
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1337  219
##          1    8   16
##                                           
##                Accuracy : 0.8563          
##                  95% CI : (0.8381, 0.8733)
##     No Information Rate : 0.8513          
##     P-Value [Acc > NIR] : 0.3             
##                                           
##                   Kappa : 0.0987          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.06809         
##             Specificity : 0.99405         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.85925         
##              Prevalence : 0.14873         
##          Detection Rate : 0.01013         
##    Detection Prevalence : 0.01519         
##       Balanced Accuracy : 0.53107         
##                                           
##        'Positive' Class : 1               
## 
plot_ROC <- function(test_data = test, phat = phat) {
 # Define the probability thresholds you want to use
   pthresh <- seq(0, 1, 0.1)
   
   # Create the ROC curve
   roc_curve <- roc(test_data$ten_year_chd, phat, plot = F, print.auc=T)
   
   # Get the coordinates (sensitivity and specificity) for the specified thresholds
   coordinates <- pROC::coords(roc_curve, pthresh)

    # Using highcharter
    h <- coordinates %>% 
    hchart(
        type = "line",
        hcaes(x = specificity, y = sensitivity, threshold = threshold)
    ) %>% 
    hc_xAxis(title = list(text = "Specificity"), reversed = TRUE, min = 0, max = 1) %>%
    hc_yAxis(title = list(text = "Sensitivity"), min = 0, max = 1) %>% 
    hc_add_series(
        data = coordinates,
        type = "scatter",
        hcaes(x = specificity, y = sensitivity, threshold = threshold),
        marker = list(radius = 6, fillColor = "lightblue", lineWidth = 2),
        tooltip = list(
            headerFormat = "",
            pointFormat = "<b>Threshold: {point.threshold:.2f}</b><br>
                           Specificity: {point.x:.2f}<br>
                           Sensitivity: {point.y:.2f}")
    ) %>% 
     hc_chart(aspectRatio = 1) %>%   # Set the equal aspect ratio
     hc_title(text = paste("ROC curve - AUC =", round(roc_curve$auc,3)))
    
    return(h)
}


plot_ROC(test_data = test, phat = phat)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases